ggplot(data = mpg). What do you see?ggplot(data = mpg)
An empty plot.
mpg? How many columns?nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
There are 234 rows and 11 columns.
drv variable describe? Read the help for ?mpg to find out.The drv variable describes which wheels of the car receive power from the engine (f = front-wheel drive, r = rear wheel drive, 4 = 4wd).
hwy vs cyl.ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
class vs drv? Why is the plot not useful?ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
The plot is not useful because both variables are categorical.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
The points are not blue because color = "blue" needs to be outside of aes to set the aesthetic manually.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?manufacturer, model, trans, drv,fl, and class are categorical, while displ, year, cyl, cty, and hwy are continuous. You can see this information by running mpg and looking at the variable types under the column headings (chr indicates categorical, while dbl or int indicates continuous).
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
## 2 audi a4 1.80 1999 4 manual… f 21 29 p
## 3 audi a4 2.00 2008 4 manual… f 20 31 p
## 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
## 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
## 6 audi a4 2.80 1999 6 manual… f 18 26 p
## 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
## 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
## 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
## 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
## # ... with 224 more rows, and 1 more variable: class <chr>
color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = displ))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = displ))
## Error: A continuous variable can not be mapped to shape
Whereas continuous variables are mapped to a spectrum of colors, shapes, or sizes, categorical variables are separated into discrete groups (as shown below for the color aesthetic).
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ, size = displ))
Each of the aesthetics is mapped for that variable (there are multiple legends).
stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), stroke = 1, shape = 21)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), stroke = 5, shape = 21)
The stroke aesthetic modifies the border thickness of shapes that have a border.
aes(colour = displ < 5)?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))
Each point is colored TRUE or FALSE based on if the statement the aesthetic is mapped to is true or false for that point (in this case, the points with an engine displacement less than 5 are colored TRUE while the points with an engine displacement greater than or equal to 5 are colored FALSE).
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cty, nrow = 3)
There will be a subplot displayed for each unique value of the continuous variable (the number of subplots displayed can potentially be very large).
facet_grid(drv ~ cyl) mean? How do they relate to this plot?ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl)) +
facet_grid(drv ~ cyl)
The empty cells indicate that there are no data points with that particular combination of variables (for example, there are no cars with 4 cylinders that have rear wheel drive).
. do?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
. is used to facet in a single dimension. facet_grid(drv ~ .) will result in a N x 1 grid, while facet_grid(. ~ cyl) will result in a 1 x N grid, where N is the number of unique values of the variable.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
Advantages to using faceting instead of the colour aesthetic: Enables visualization of patterns/trends within a particular facet
Disadvantages: Difficult to visualize global trends
With a larger dataset, the color aesthetic may not be practical as points may overlap and it may be difficult to distinguish certain colors.
?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?nrow sets the number of rows in the faceted plot, while ncol sets the number of columns in the faceted plot. dir and as.table also control the layout of the individual panels - dir determines if the plot is filled in horizontally or vertically, while as.table determines if the highest value facets are at the bottom-right or at the top-right. In facet_grid(), the number of rows and columns is implied by the variables in the parentheses (first variable determines number of rows, second variable determines number of columns).
facet_grid() you should usually put the variable with more unique levels in the columns. Why?This will cause the plot to be larger in the vertical dimension than in the horizontal dimension, and will thus prevent the plot from being compressed in the horizontal dimension (since there is less viewing space horizontally).
To draw a line chart, you would use geom_line(), to draw a boxplot, geom_boxplot(), to draw a histogram, geom_histogram(), and to draw an area chart, geom_area().
The output will be a scatterplot with engine displacement on the x axis and highway miles per gallon on the y axis (negative correlation). Both the points and the smooth lines will be colored based on whether the car is front-wheel drive, rear wheel drive, or four wheel drive.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?show.legend = FALSE prevents the legend from being displayed. It was used earlier in the chapter to ensure all three plots in the example had the same format.
se argument to geom_smooth() do?If se = TRUE (the default) then there is a confidence interval drawn around the smooth line. If se = FALSE then there is no confidence interval drawn.
These graphs will look exactly the same. They are using the same dataset and the same mapping conditions, the only difference is that in the first code block the mappings are global mappings that apply to each geom in the graph and in the second code block the mappings are local mappings for a specific layer.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(mapping = aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = drv)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = drv)) +
geom_smooth(mapping = aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(color = "white", size = 4) +
geom_point(aes(color = drv))
stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?The default geom associated with stat_summary() is geom_pointrange().
Rewritten:
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
geom_col() do? How is it different to geom_bar()?geom_col() and geom_bar() both create bar charts, but with geom_bar() the height of the bar is proportional to the number of cases in each group (uses stat_count to count the number of cases at each x position), whereas with geom_col() the heights of the bars are themselves values in the data (no counting necessary - uses stat_identity).
geom_abline and stat_abline, geom_hline and stat_hline, geom_vline and stat_vline, etc… Most geom and stat pairs have similar names.
stat_smooth() compute? What parameters control its behaviour?stat_smooth() computes y (predicted value), ymin (lower pointwise confidence interval around the mean), ymax (upper pointwise confidence interval around the mean), and se (standard error). Many parameters control its behavior, including method (which defines the smoothing method to use), se (which defines whether or not to display a confidence interval), and level (which defines the level of confidence interval to use).
group = 1. Why? In other words what is the problem with these two graphs?Without group = 1:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
We need to set group = 1 to specify that all of the data should be regarded as one group. Otherwise, each cut is considered a separate group and we get proportions of 1 everywhere.
With group = 1:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
Many points appear to overlap each other (overplotting). You could improve the plot by adding jitter, which will add a small amount of random noise to each point.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_jitter()
geom_jitter() control the amount of jittering?width and height control the amount of jittering.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_jitter(width = 5, height = 10)
geom_jitter() with geom_count().Both geom_jitter() and geom_count() are used to manage overplotting. While geom_jitter() adds a small amount of random noise to the location of each point, geom_count() counts the number of observations at each location, mapping count to point area.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_jitter()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_count()
geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.The default for geom_boxplot() is position_dodge.
ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
geom_boxplot()
Looks the same:
ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
geom_boxplot(position = "dodge")
Do not look the same:
ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
geom_boxplot(position = "jitter")
ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
geom_boxplot(position = "identity")
coord_polar().ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), width = 1)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), width = 1) +
coord_polar()
labs() do? Read the documentation.labs() is used to change axis labels and legend titles.
coord_quickmap() and coord_map()?coord_quickmap is a quick approximation that preserves straight lines, while coord_map does not preserve straight lines and thus generally requires more computation.
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_map()
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_fixed() important? What does geom_abline() do?ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
There is a positive correlation between city and highway mpg. coord_fixed() ensures that one unit on the x axis is the same length as one unit on the y axis, thus the x and y values are directly comparable. geom_abline() adds a reference line to the plot (default values: intercept = 0, slope = 1).